# set the default ggplot theme for the whole notebook
theme_set(theme_minimal())
# set defaults for the appearance/supression of code chunks in the markdown doc
knitr::opts_chunk$set(message = FALSE, echo = FALSE)
# set green to grey colour scale for use in tables and graphs
colour_palette <- scales::seq_gradient_pal(low = "palegreen4", high = "grey")
colour_scale <- colour_palette(seq(0, 1, length.out = 5))
# set the background colour for the table
background_colour <- "#EDEDED"
Audience:
Style:
Friends of the Earth (FoE) have recently released a report focused on “England’s Green Space Gap.” The headline finding of the report is that one in five people in England live in areas where it is difficult to access green space . The report also provides a holistic overview of why green space is so important, by highlighting how individuals and communities benefit from having access to both public and private green space. These benefits which stretch far beyond the natural environmental itself, and encompass a myriad of social, health and economic benefits.
As part of the research underpinning the Green Space Gap report, Friends of the Earth have developed a new approach for classifying the extent to which neighborhoods (or Middle Super Output Areas in the terminology of the administrative geography) across England experience green space deprivation. Neighborhoods are classified into five groups; with group A including the least green space deprived neighborhoods, and E including the most green space deprived.
Friends of Earth have released the dataset that they developed and used to classify green space deprivation within the Green Space Gap report. In this notebook I plan to conduct an exploratory data analysis using this Friends of the Earth dataset. Before doing so, I think it might be helpful to outline the way in which Friends of the Earth processed the dataset. This is outlined in the figure below and incorporated the following steps:
Producing the Friends of the Earth Green Space Deprivation ratings.
n.b. In the report, Friends of the Earth draw on the Index of Multiple Deprivation (IMD) dataset to explore the relationship between the green space deprivation ratings and demographic factors including ethnicity and income.
Reading the Green Space Gap report and exploring the associated dataset, I was struck by a number of questions about the nature and scope of green space deprivation in England. I thought that these questions might be a good basis for my exploratory data analysis.
What is the scale of the green space deprivation problem in England?
Is green space deprivation an urban problem?
How is green space deprivation distributed across regions in England?
What can the dataset tell us about what green space deprivation looks like in England?
Below I address each of these questions in turn with the aim of extending upon the analysis of the data presented in the report. By doing so, I hope to contribute to the wider debate on maintaining and extending access to green space during the post-covid recovery.
Ahead of moving on to the exploratory data analysis itself, I thought it would be helpful to very briefly document the datasets I used. This includes the Friends of the Earth dataset, and additional datasets from ONS which proved interesting or helpful in the context of my exploratory data analysis. In particular, I thought it was recording the versions of the dataset used where multiple version are available from the ONS
| variable_name | file_name | notes | url |
|---|---|---|---|
| green_space | (FOE) Green Space Consolidated Data - England - Version 2.1.xlsx | … | https://friendsoftheearth.uk/nature/green-space-consolidated-data-england |
| LAD_to_region | Local_Authority_District_to_Region__December_2019__Lookup_in_England.csv | used the December 2019 version | https://geoportal.statistics.gov.uk/datasets/3ba3daf9278f47daba0f561889c3521a_0 |
| urban_rural_classification | RUC_MSOA_2001_EW_LU.csv | 2001 was the latest version available | https://geoportal.statistics.gov.uk/datasets/rural-urban-classification-2001-of-msoas-in-england-and-wales |
Ahead of conducting the exploratory data analysis I imported the three datasets and then merged into the single dataframe shown below.
Below I plot the proportions of the r sum(msoa_by_rating$n_msoa) MSOAs analyzed given each green space deprivation rating.
| Green Space Deprivation in England | ||||
|---|---|---|---|---|
| Understanding the scale of the problem | ||||
| Green Space Deprivation Rating |
Neighbourhoods | Population | ||
| Number | % | Millions | % | |
| Urgent action needed to improve access to green space | ||||
| E | 1108 | 16 | 9.62 | 18 |
| D | 955 | 14 | 8.21 | 15 |
| Total | 2,063 | 30 | 17.84 | 33 |
| Action needed to protect access to green space | ||||
| C | 1727 | 25 | 13.54 | 25 |
| B | 1360 | 20 | 10.77 | 20 |
| A | 1641 | 24 | 12.58 | 23 |
| Total | 4,728 | 70 | 36.89 | 67 |
| Source: Friends of the Earth | ||||
Where are the regions where action is most needed?
Which regions have the highest/lowest proportion of D and E rated neighborhoods?
A data-driven approach to classifying neighbourhoods (k means clustering)
Green space deprivation and Covid-19
The demographics of green space deprivation
dealing with outliers Boxplot shows outliers at 1.5*IQR + Q3 - they are part of the natural variability of the population, so it seems appropriate to retain the outliers, but zoom on the graphs because the .
not sure on whether or not to filter out outliers
So, I wondered if the outliers/very long tail are a result of areas with small populations and/very large areas of green space.
So, it looks like the it is the green space area has much more influence on green space area per capita, than population.
So, lets look at the distribution of green_space_area itself. This is relatively tricky given the wide range of values (as shown in the summary stats). I tried histograms and density plots too, but a box plot seemed the best way to understand the distribution. The first boxplot shows the full distribution and as a result is very difficult to interpret as the large outliers to the right of plot result in the box itself appearing as a single line and hence being very difficult to interpret. In the second plot hte x axis is cropped so it is straight forward to interpret the box component of the plot. However, this comes at the cost of failing to show the very large outliers within the distribution.
The extreme skew of the distribution can be seen in the summary statistics below. The median for green_space_area is 152,418 m2 while the maximum 636,087,671 m2.
A similarly extreme right skewed distribution can seen for green_space_area_per_capita, as shown in the plots and summary stats below. It is worth noting just how atypical many of the larger outliers are. The median for green_space_area_per_capita is less than 20 m2 per capita, while the maximum is approximately 100,000 m2 per capita.
So, I thought it was worth a quick look at the population density across English MSOAs. The first graph shows the kernel density function for the population density of English MSOAs. Key features of the distribution include:
The second plot groups MSOAs by their FoE green space deprivation rating and highlights:
Plotting population density against green space area and green space area per capita produces very associations. Note the log scales on both the x and y axis in both cases.
From histogram:
pcnt_pop_with_go_space_access is uniform.pcnt_pop_with_go_space_access and frequency.from the boxplot with grouping by rating
Overall it is not clear to what extent pcnt_pop_with_go_space_access is influencing the ratings … would dropping it make much difference to how MSOA are classified?
75% pcnt_pop_with_go_space_access is used as a cut off point for some classifications. This figure seems high, it is at approximately the 95th percentile (see calculation below).
Rural-urban classification at LA scale
https://www.gov.uk/government/statistics/local-authority-rural-urban-classification Rural-Urban Classification of Local Authorities Post-2009 Boundaries
Rural-urban classication at MSOA scale
https://geoportal.statistics.gov.uk/datasets/rural-urban-classification-2001-of-msoas-in-england-and-wales urban_rural_classification
Some thoughts on where I am in understanding the FoE ratings and green_space_area:
green_space_area and green_space_area_per_capita it doesn’t make sense to me to talk solely about green space deprivation. There are clearly places that are green space affluent … For example, the typical (median) amount of green space area per capita for an MSOA is 20 m2. Whilst, the MSOA with the most green space per capita has approximately 5000 times more green space per capita than the typical MSOA. But I guess FoE are following the terminology/approach of the Indices of Multiple Deprivation dataset provided by the ONS.green_space_area_per_capita for MSOAs in each rating (as shown in the table below) raises a question in my mind as the amounts of green_space_area_per_capita are relatively similar across ratings E to B … Is it a different experience to live in a neighborhood with approximately 3m2 green_space_area_per_capita (as is typical for a E rated MSOA) or with approximately 16m2 green_space_area_per_capita (as is typical for a E rated MSOA)? Which got me wondering how much green space is enough (Russo and Cirella 2018) - it appear that this is a question that hasn’t been the subject of too much research to date …I think it might be actively helpful to remove the right skew from the two green space variables in the context of looking at green space deprivation. Why?
The the data from all three variables definitely needs transforming to lie on a scale of 0 to 1, to ensure that the kmeans algorithm applies a roughly equal weight to each variable. When putting the raw data into kmeans, green_space_area_per_capita is the predominant factor in determining clusters due to the fact that it’s values are much larger than those of the other variables …
My initial efforts in transforming the data - a log transformation and then scaling values to the unit interval (i.e. 0…1) - proved rather unsuccessful. See the summary stats below, with the transformed values remaining tightly grouped together around the median.
So, I wondered about focusing on a subset of the data which could be easier to work with. Perhaps given the focus on green space deprivation it makes sense to remove those clearly green space affluent MSOAs (e.g. those with 10,000’s m2 public green space per capita)
The two plots below, respectively show for each region the numbers or proportion of MSOA with each green space deprivation rating. Key insights from the two plots include:
Below I plot again plot the proportion of MSOAs within each region receiving each green space deprivation rating. This time faceting the plot by green space deprivation rating rather than region. This makes it easier to compare across regions at a given rating.
It would be easier to read if I produced separate plots for each green space deprivation rating, as then the regions could be put in rank order.
Next I explored an alternative appraoch to considering the distribution of msoas by rating and region. I plotted the proportion of MSOAs that received a specific rating by region. In order to address the ordering issue above, I produced one plot for each green space rating. This meant addressing the challenge of how to ensure the colour associated with a given region was applied consistently across the five plots. Here is where I found the an approach to doing this, using scale_…_manual.
How to map a colour to a value of a categorical variable …
This approach help me identify some additional insights:
Could do ridgeline plots for each region - for prop rated A and proportion rated E
https://datacarpentry.org/r-raster-vector-geospatial/06-vector-open-shapefile-in-r/
A quick visual inspection of the MSOAs colored by their green space deprivation rating, shows a similiar patter across the regions (with the exception of London). With the the D and E ratings (oranges and reds) occurring in smaller (presumably more densely populated MSOAs) which make up urban areas. While the larger, more rural MSOAs tend to be less green space deprived, and have A or B ratings. Given the whole region of London would probably be considered a continuous urban space, it is unsurprising to observe many green space deprived MSOAs across the region/plot, with relatively few less green space deprived areas present.
Ideas: